Lecture 3
NHMRC Clinical Trials Centre, University of Sydney
gillian.heller@sydney.edu.au
Clearly a linear model will be inadequate.
To start, we analyse BMI for boys aged between 10 and 20, which is approx linear growth
\begin{align*} y_i&=\text{ BMI of boy }i, \quad x_i=\text{ age of boy }i\\ y_i&=\beta_0+\beta_1 x_i+\epsilon_i\\ \epsilon_i&\sim\mathcal{N}(0,\sigma^2) \end{align*}
Select the response distribution
Select covariates for all of the model parameters of the chosen distribution
Check model fit
Go back to 1.
This is a complex process
gamlss
gamlss() fits a distribution on the response using penalized maximim likelihood (RS or CG or mixed algorithms)
m <- gamlss(y ~ x1 + x2,
sigma.formula =~ x1 + x2,
nu.formula =~ x2 + x3,
tau.formula =~ x2 + x3,
data = dataset,
family = NO,
n.cyc=20, ...)
histDist() plots a histogram of the response with the specified distribution
fitDist() fits a set of distributions to the response and chooses the one with the smallest GAIC (marginal fit)
chooseDist() fits a set of distributions on the specified model and chooses the one with the smallest GAIC (conditional fit)
gamlss2
gamlss2() fits a distribution on the response using penalized maximim likelihood (RS or CG or mixed algorithms)
m <- gamlss2(y ~ x1 + x2 |.|x2 + x3|.,
data = dataset,
family = NO,
maxit=20, ...)
find_family() fits a set of distributions on an intercept-only (i.e. marginal) model and chooses the one with the smallest GAIC
find_gamlss2(formula, families = available_families(type="continuous"), k = 2, select = FALSE, ...) selects a distribution conditional on formula according to GAIC
find_gamlss2(formula, families = available_families(type="continuous"), k = 2, select = TRUE, ...) selects a distribution and variables according to GAIC
\begin{split} \text{GDEV}\,=& -2\ell(\boldsymbol{\hat{\theta}}) \\ =& -2 \sum_{i=1}^n \log f(y_i |\boldsymbol{\hat{\theta}}) \end{split}
\begin{split} {\rm AIC} &= \text{GDEV}+2 \, \text{df}\\ {\rm BIC} &= \text{GDEV}+ \log n \cdot \text{df} \end{split}
\text{GAIC}(k) = \text{GDEV} + k \cdot \text{df}
GAMLSS-RS iteration 1: Global Deviance = 16958.7855 eps = 0.069449
GAMLSS-RS iteration 2: Global Deviance = 16958.7855 eps = 0.000000
Call:
gamlss2(formula = bmi ~ age, data = dbbmi10, family = NO)
---
Family: NO
Link function: mu = identity, sigma = log
*--------
Coefficients:
Estimate Std. Error t value Pr(>|t|)
mu.(Intercept) 10.87058 0.22405 48.52 <2e-16 ***
mu.age 0.57843 0.01487 38.90 <2e-16 ***
sigma.(Intercept) 0.90673 0.01171 77.43 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
*--------
n = 3646 df = 3 res.df = 3643
Deviance = 16958.7855 Null Dev. Red. = 6.94%
AIC = 16964.7855 elapsed = 0.05sec
Clearly the normal linear model provides a very bad fit
We need to find another distribution for BMI
The following code chooses the best-fitting continuous distribution using the AIC (marginal fit)
f <- find_family(dbbmi10$bmi, k=2, families = available_families(type = "continuous"),
verbose = FALSE).. .. IC = 17884.13
.. .. IC = 17884.13
.. .. error
.. .. IC = 17882.88
.. .. IC = 17882.88
.. .. error
.. .. IC = 17886.13
.. .. IC = 17886.13
.. .. error
.. .. error
.. .. error
.. .. IC = 17956.21
.. .. IC = 17920.96
.. .. IC = 28931.35
.. .. IC = 18008.99
.. .. IC = 90191.43
.. .. error
.. .. IC = 17955.96
.. .. IC = 17883.39
.. .. IC = 17905.94
.. .. IC = 29273.12
.. .. IC = 18133.24
.. .. IC = 19819.34
.. .. IC = 17940.46
.. .. IC = 55458.11
.. .. IC = 17885.2
.. .. IC = 18032.16
.. .. error
.. .. IC = 18167.88
.. .. error
.. .. IC = 17940.32
.. .. IC = 17940.32
.. .. IC = 19136.73
.. .. error
.. .. IC = 18228.47
.. .. IC = 18228.47
.. .. IC = 18230.47
.. .. IC = 36755.01
.. .. IC = 37253.67
.. .. error
.. .. IC = 29273.12
.. .. IC = 31709.55
.. .. IC = 18399.27
.. .. IC = 18191.22
.. .. IC = 17917.08
.. .. error
.. .. IC = 18091.67
.. .. IC = 17967.2
.. .. IC = 18091.67
.. .. IC = 17912.58
.. .. IC = 17883.73
.. .. IC = 17922.92
.. .. IC = 17933.34
.. .. IC = 17914.67
.. .. error
.. .. IC = 18230.48
.. .. IC = 17915.17
.. .. IC = 17905.84
.. .. IC = 17974.12
.. .. IC = 17990.73
.. .. IC = 17906.96
.. .. IC = 17906.96
.. .. IC = 17995.27
.. .. IC = 17964.2
.. .. IC = 18154.08
.. .. IC = 18154.13
.. .. IC = 20507.78
.. .. IC = 19694.15
.. .. IC = 20821.08
GAF IGAMMA PARETO1 PARETO PARETO2o GP PARETO2 EXP
90191.43 55458.11 37253.67 36755.01 31709.55 29273.12 29273.12 28931.35
WEI3 WEI GU WEI2 LQNO PE SN1 NOF
20821.08 20507.78 19819.34 19694.15 19136.73 18399.27 18230.48 18230.47
NO2 NO PE2 LO TF2 TF GT SEP2
18228.47 18228.47 18191.22 18167.88 18154.13 18154.08 18133.24 18091.67
SEP JSUo GA ST4 ST2 ST1 SEP1 ST5
18091.67 18032.16 18008.99 17995.27 17990.73 17974.12 17967.20 17964.20
EGB2 GB2 IG LOGNO LOGNO2 SHASHo SHASH exGAUS
17956.21 17955.96 17940.46 17940.32 17940.32 17933.34 17922.92 17920.96
RG SN2 SHASHo2 SEP3 ST3 ST3C GIG SST
17917.08 17915.17 17914.67 17912.58 17906.96 17906.96 17905.94 17905.84
BCTo BCT JSU BCCG BCCGo SEP4 GG BCPE
17886.13 17886.13 17885.20 17884.13 17884.13 17883.73 17883.39 17882.88
BCPEo
17882.88
The chosen distribution is BCPEo (Box-Cox Power Exponential):
Family: c("BCPEo", "Box-Cox Power Exponential-orig.")
Fitting method: "nlminb"
Call: gamlssML(formula = dbbmi10$bmi, family = "BCPEo")
Mu Coefficients:
[1] 2.948
Sigma Coefficients:
[1] -1.929
Nu Coefficients:
[1] -0.7292
Tau Coefficients:
[1] 0.7624
Degrees of Freedom for the fit: 4 Residual Deg. of Freedom 3642
Global Deviance: 17874.9
AIC: 17882.9
SBC: 17907.7
It makes more sense to choose the best-fitting conditional distribution.
We specify age as the covariate for all distribution parameters
I am using the gamlss functions here because gamlss2 gave errors.
m0 <- gamlss(bmi ~ age,
sigma.formula =~ age,
nu.formula =~ age,
tau.formula =~ age,
family = NO,
data=dbbmi10,
n.cyc = 100)GAMLSS-RS iteration 1: Global Deviance = 16945.38
GAMLSS-RS iteration 2: Global Deviance = 16945.38
GAMLSS-RS iteration 3: Global Deviance = 16945.38
minimum GAIC(k= 2 ) family: exGAUS
minimum GAIC(k= 3.84 ) family: exGAUS
minimum GAIC(k= 8.2 ) family: exGAUS
X2 X3.84 X8.2
exGAUS 16288.32 16299.36 16325.52
EGB2 16294.60 16309.32 16344.20
SEP4 16295.21 16309.93 16344.81
JSU 16295.59 16310.31 16345.19
SHASH 16298.27 16312.99 16347.87
GB2 16299.84 16314.56 16349.44
BCPE 16301.72 16316.44 16351.32
ST1 16302.12 16316.84 16351.72
JSUo 16302.78 16317.50 16352.38
BCT 16302.97 16317.69 16352.57
ST5 16303.81 16318.53 16353.41
ST2 16304.41 16319.13 16354.01
BCCG 16304.43 16315.47 16341.63
BCPEo 16304.79 16319.51 16354.39
BCTo 16306.36 16321.08 16355.96
BCCGo 16307.62 16318.66 16344.82
GG 16311.38 16322.42 16348.58
RG 16318.55 16325.91 16343.35
SHASHo2 16318.85 16333.57 16368.45
SHASHo 16318.86 16333.58 16368.46
SST 16323.33 16338.05 16372.93
ST3 16324.00 16338.72 16373.60
SEP1 16328.94 16343.66 16378.54
SEP3 16330.91 16345.63 16380.51
SEP2 16331.19 16345.91 16380.79
SN1 16377.49 16388.53 16414.69
ST4 16387.83 16402.55 16437.43
SN2 16426.90 16437.94 16464.10
IGAMMA 16464.82 16472.18 16489.62
LOGNO2 16553.75 16561.11 16578.55
LNO 16553.75 16561.11 16578.55
LOGNO 16553.75 16561.11 16578.55
IG 16557.30 16564.66 16582.10
GT 16652.42 16667.14 16702.02
TF2 16658.73 16669.77 16695.93
TF 16659.35 16670.39 16696.55
GA 16666.47 16673.83 16691.27
LO 16695.28 16702.64 16720.08
NET 16703.89 16711.25 16728.69
PE 16723.42 16734.46 16760.62
PE2 16723.84 16734.88 16761.04
NO 16953.38 16960.74 16978.18
WEI 17978.46 17985.82 18003.26
WEI3 17978.51 17985.87 18003.31
WEI2 18177.15 18184.51 18201.95
GU 18980.03 18987.39 19004.83
EXP 28908.71 28912.39 28921.11
PARETO2o 28963.20 28970.56 28988.00
PARETO2 28984.92 28992.28 29009.72
GP 28984.92 28992.28 29009.72
GIG NA NA NA
GAMLSS-RS iteration 1: Global Deviance = 16748.3316 eps = 0.249466
GAMLSS-RS iteration 2: Global Deviance = 16595.5891 eps = 0.009119
GAMLSS-RS iteration 3: Global Deviance = 16468.0303 eps = 0.007686
GAMLSS-RS iteration 4: Global Deviance = 16380.1908 eps = 0.005333
GAMLSS-RS iteration 5: Global Deviance = 16327.6312 eps = 0.003208
GAMLSS-RS iteration 6: Global Deviance = 16300.2541 eps = 0.001676
GAMLSS-RS iteration 7: Global Deviance = 16287.0502 eps = 0.000810
GAMLSS-RS iteration 8: Global Deviance = 16281.0575 eps = 0.000367
GAMLSS-RS iteration 9: Global Deviance = 16278.385 eps = 0.000164
GAMLSS-RS iteration 10: Global Deviance = 16277.2164 eps = 0.000071
GAMLSS-RS iteration 11: Global Deviance = 16276.7055 eps = 0.000031
GAMLSS-RS iteration 12: Global Deviance = 16276.4822 eps = 0.000013
GAMLSS-RS iteration 13: Global Deviance = 16276.3889 eps = 0.000005
Call:
gamlss2(formula = bmi ~ age | age | age, data = dbbmi10, family = exGAUS)
---
Family: exGAUS
Link function: mu = identity, sigma = log, nu = log
*--------
Coefficients:
Estimate Std. Error t value Pr(>|t|)
mu.(Intercept) 8.731104 0.276086 31.625 < 2e-16 ***
mu.age 0.581581 0.019263 30.192 < 2e-16 ***
sigma.(Intercept) -0.583069 0.160032 -3.643 0.000273 ***
sigma.age 0.056686 0.010409 5.446 5.50e-08 ***
nu.(Intercept) 0.760758 0.152497 4.989 6.36e-07 ***
nu.age -0.001436 0.010458 -0.137 0.890780
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
*--------
n = 3646 df = 6 res.df = 3640
Deviance = 16276.3889 Null Dev. Red. = 9.15%
AIC = 16288.3889 elapsed = 1.01sec
Recall the GAMLSS model
\begin{align*} y_{i}&\sim \textcolor{red}{\mathcal{D}\left(\theta_{i1},\ldots,\theta_{iK}\right)} \\ \textcolor{green}{g_k(\theta_{ik})}&=\beta_0^{k}+\textcolor{blue}{s_1^{k}(x_{i1})+\ldots+s_p^{k}( x_{ip})}\qquad k=1,\ldots,K \end{align*}
Assume we have chosen \mathcal{D}\left(\theta_{i1},\ldots,\theta_{iK}\right)
We now need to choose the terms s_1^{k}(x_{1}),\ldots,s_p^{k}( x_{p}) for each distribution parameter \theta_k.
We firstly define the options for the terms s^k_j(x_j)
Then we discuss strategies for model selection
parametric e.g. x_j, \log(x_j), x_j^2, etc
smooth term (splines)
random effect
spatial effect
All of these need to be expressed as linear functions
s({x}_i)=\sum_{\ell=1}^L \gamma_\ell \boldsymbol{B}_\ell ({x}_i)
where
B_\ell ({x}_i) represent different types of basis functions
\gamma_\ell are the basis amplitudes
In matrix notation, each of the predictors can then be written for all observations as
\boldsymbol{\eta}=\beta_0 \mathbf{1}_n+\boldsymbol{B}_1 \gamma_1+\ldots+\boldsymbol{B}_{\jmath} \gamma_{\jmath} .
s(x)=\sum_{\ell=1}^L \gamma_\ell B_\ell(x)
\Rightarrow Add a regularization term that penalizes function estimates with high variability.
\operatorname{pen}(s)=\lambda \int\left(s^{\prime \prime}(x)\right)^2 d x
| Term | gamlss | gamlss2 |
|---|---|---|
| parametric | x1 + x2^2 + log(x3) |
x1 + x2^2 + log(x3) |
| nonlinear (spline) | pb(x1) |
s(x1) |
| cyclic spline | pbc(x1) |
|
| random intercept | re(random =~ 1|id) |
s(id, bs="re") |
| random slope | re(random =~ x1|id) |
s(id, x1, bs="re") |
Fit a BCTo model with only a smooth term for age in \mu (just to illustrate P-splines)
Compute fitted values \hat\mu (\mu is the median of the BCTo distribution)
Vary the penalty in the spline (df) to see the effect
m1 <- gamlss(bmi ~ pb(age), data=dbbmi, family=BCTo, trace = FALSE) # default penalty
m2 <- gamlss(bmi ~ pb(age, df=3), data=dbbmi, family=BCTo, trace = FALSE)
m3 <- gamlss(bmi ~ pb(age, df=15), data=dbbmi, family=BCTo, trace = FALSE)
# compute fitted values for mu
p1 <- predict(m1, what="mu", type = "response")
p2 <- predict(m2, what="mu", type = "response")
p3 <- predict(m3, what="mu", type = "response")
dbbmi <- cbind(dbbmi, p1, p2, p3)ggplot(dbbmi, aes(x=age, y=bmi)) +
geom_point(shape = 21, fill = "lightgray", color = "black", size = 0.05) +
geom_smooth(aes(x=age, y=p1), size=0.5, colour = "blue") +
geom_smooth(aes(x=age, y=p2), colour = "red", size=0.5) +
geom_smooth(aes(x=age, y=p3), colour = "green", size=0.5) +
theme_bw() Lecture 3